From source code identifiers to natural language terms

نویسندگان

  • Nuno Ramos Carvalho
  • José João Almeida
  • Pedro Rangel Henriques
  • Maria João Varanda Pereira
چکیده

Program comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks. Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented. © 2014 Elsevier Inc. All rights reserved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Probabilistic SynSet Based Concept Location

Concept location is a common task in program comprehension techniques, essential in many approaches used for software care and software evolution. An important goal of this process is to discover a mapping between source code and human oriented concepts. Although programs are written in a strict and formal language, natural language terms and sentences like identifiers (variables or functions n...

متن کامل

Identifying Idioms of Source Code Identifier in Java Context

This paper presents an approach to identifying a domain word POS(Part of Speech) and idiom code identifiers written in Java programming language. To detect them, we extracted common identifiers from 14 Java API documents, and applied diverse filters. In addition, NLP (Natural Language Parser) has been used to detect common mistakes in the Java API documents. As a result, this paper identified 8...

متن کامل

Supporting Concept Extraction and Identifier Quality Improvement through Programmers' Lexicon Analysis

Identifiers play an important role in communicating the intentions associated with the program entities they represent. The information captured in identifiers support programmers to (re-)build the “mental model” of the software and facilitates understanding. (Re-)building the “mental model” and understanding large software, however, is difficult and expensive. Besides, the effort involved in t...

متن کامل

The impact of vocabulary normalization

Software development, evolution, and maintenance depend on ever increasing tool support. Recent tools have incorporated increasing analysis of the natural language found in source code, predominately in the identifiers and comments. However, when coders combine abbreviations and acronyms to form multiword identifiers, they, in essence, invent new vocabulary making the source code’s vocabulary d...

متن کامل

Context Awareness for Effective Software Structure Quality

This paper presents an approach that helps developers to maintain source code identifiers and comments dependable with high-level artifact. This approach calculates and shows the textual similarity source code and related artifacts. The assumption is developers are induced to improve the source code lexicon (terms) used in identifiers or comments. The software development environment provides i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Journal of Systems and Software

دوره 100  شماره 

صفحات  -

تاریخ انتشار 2015